{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Grocery sales data :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Data is avaiable in form of **star schema** (central table and other tables with metadata). Sometimes we have **snowflake schema**. \n",
    "* To reduce the space -- \n",
    "    * Define datatypes before reading data (read datatypes by loading only first 2 rows of data in bash/ use `shuf` to get random sample of data\n",
    "    * Use feather format\n",
    "    * `onpromotion` saved as `object` because it is boolean with NAs --> Replaced later in code and converted to bool\n",
    "* EDA of data --\n",
    "    * Use only last few weeks/months data\n",
    "    * Transform `sales` to log as Kaggle is checking accuracy on `room mean square log`\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interpretetion from Random Forest result :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Confidence interval :\n",
    "\n",
    "* Standard deviation of trees of prediction gives how confident we are for the predicitons\n",
    "* Use **Parallel trees** (fastai) --> To stack trees parallely (specific to random forest)\n",
    "* Feedback tips --> Group confidence intervals and look at which group (categorical feature) is contributing to low prediction accurac"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature importance :\n",
    "\n",
    "Which features matter in random forest ?\n",
    "\n",
    "* Use `rf_feat_importance` and plot top features based on their importance\n",
    "\n",
    "What to do with important features --\n",
    "* Gather domain knowledge about important feature\n",
    "* Redo random forest using only top features (select a cutoff threshold) --> It will make model **slightly** better and speed up modeling speed\n",
    "* We want **independent** variables for interpretetion\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A technique for working with features:\n",
    "* **Shuffling rows** of one column one by one, keeping same rf model and doing predictions to find out how the results change"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Questions :\n",
    "\n",
    "* **How does 1 decision tree in default random forest takes sub sample or does it train on complete data ? **   \n",
    "*Answer*- If `bootstraping` = False then it takes all samples without replacement. So it will have all the raws. If `bootstraping` = True then it will take len(df) rows but with replacement. So there will be duplicates which make each tree different. Default is *True*  \n",
    "  \n",
    "\n",
    "* **Because Kaggle is validating on `log error`, does that necessarily mean that log of our independent variable is dependent on dependent variables ?**   \n",
    "*Answer*- The way Random Forests are built is invariant to monotonic transformations of the independent variables  \n",
    "  \n",
    "\n",
    "* **Why not put `oob_score` while using `set_rf_sample` ?**  \n",
    "*Answer*- Because we are passing full data to random forest model and trees are built on subset of data defined in `set_rf_sample`. So each tree will try to calculcate score on remaining data (len(df) - set_rf_sample) which will consume more time. \n",
    "    \n",
    "    \n",
    "* **What did he talk about data leakage problem ?**    \n",
    "*Answer*- Sometimes, we make model on some variables which might not be available during real time predictions. This causes data leakage.  \n",
    "  \n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### HW :\n",
    "\n",
    "* Get insights about predictors by playing around with data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
